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ABSTRACT 

Such  interactive,  distributed  multimedia  applications  as  shared 
whiteboards,  group  editors,  and  simulations  require  reliable  con¬ 
current  multicast  services,  i.e.,  the  reliable  dissemination  of  infor¬ 
mation  from  multiple  sources  to  all  the  members  of  a  group.  Fur¬ 
thermore,  it  makes  sense  to  offer  that  service  on  top  of  the  increas¬ 
ingly  available  IP  multicast  sendee,  which  offers  unreliable  multi¬ 
casting.  This  paper  establishes  that  concurrent  reliable  multicast¬ 
ing  over  the  Internet  should  be  based  on  reliable  multicast  proto¬ 
cols  based  on  a  shared  acknowledgment  tree.  First,  we  show  that 
organizing  the  receivers  of  a  reliable  multicast  group  into  an  ac¬ 
knowledgment  tree  and  using  NAK -avoidance  with  periodic  polling 
in  local  groups  inside  such  a  tree  provides  the  highest  maximum 
throughput  among  all  classes  of  reliable  multicast  protocols  pro¬ 
posed  to  date.  Second,  we  introduce  Lorax,  which  demonstrates  the 
viability  of  implementing  a  reliable  multicasting  approach  in  the  In¬ 
ternet  based  on  acknowledgment  trees  in  a  scalable  manner.  Lorax 
is  the  first  known  protocol  that  constructs  and  maintains  a  single  ac¬ 
knowledgment  tree  for  reliable  concurrent  multicasting,  eliminates 
the  need  to  maintain  an  acknowledgment  tree  for  each  source  of  a 
reliable  multicast  group,  and  can  be  used  in  combination  with  any 
of  several  tree-based  reliable  multicast  protocols  proposed  to  date. 
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1  INTRODUCTION 

Interactive,  distributed  multimedia  applications  like  shared  white¬ 
boards,  group  editors,  and  simulations  require  a  reliable  concurrent 
multicasting  service.  Such  a  service  consists  of  disseminating  infor¬ 
mation  from  multiple  sources  to  all  members  of  a  multicast  group, 
such  that  (a)  every  packet  from  each  source  is  delivered  to  each  re¬ 
ceiver  within  a  hnite  time,  free  of  errors,  with  no  duplicates,  and 
in  the  order  sent  by  the  source;  and  (b)  nodes  responsible  for  re¬ 
transmitting  packets  can  delete  packets  from  memory  within  a  finite 
time. 

The  development  and  implementation  of  end-to-end  protocols  for 


reliable  concurrent  multicasting  over  the  Internet  is  being  enabled 
by  the  increasing  availability  of  multicast  routing  in  Internet  routers. 
IP-Multicast  routers  permit  sources  to  transmit  data  unreliably  to 
multiple  receivers  [  1  ] .  The  most  critical  challenge  for  the  successful 
development  and  implementation  of  end-to-end  reliable  protocols 
built  on  top  of  IP  multicast  consists  of  avoiding  the  acknowledgment 
(ACK)  implosion  problem  in  large  multicast  groups:  in  a  very  large 
reliable  multicast  session,  the  sources  may  be  overwhelmed  by  the 
amount  of  work  required  to  process  the  acknowledgments  sent  by 
the  large  receiver  set. 

A  considerable  amount  of  work  has  been  reported  in  the  recent  past 
on  how  to  cope  with  or  eliminate  the  ACK-implosion  problem  [2] — 
[16].  However,  the  design  of  reliable  multicast  protocols  is  complex 
and  there  is  no  consensus  yet  on  which  is  the  best  approach  for 
the  implementation  of  protocols  for  scalable,  reliable  concurrent 
multicasting  over  the  Internet.  This  paper  makes  the  case  that  end- 
to-end  reliable  concurrent  multicasting  over  the  Internet  should  be 
based  on  protocols  based  on  a  shared  acknowledgment  tree.  We 
establish  our  case  in  three  parts. 

First,  in  Section  2,  we  summarize  the  known  classes  of  protocols 
that  have  been  proposed  for  end-to-end  reliable  multicasting.  In 
Section  3,  we  use  this  taxonomy  and  an  approximate  model  to  ana¬ 
lyze  the  maximum  throughput  of  these  protocol  classes.  Our  anal¬ 
ysis  shows  that  the  tree-NAPP  protocol  class  is  the  most  scalable 
approach  with  respect  to  the  number  of  receivers  and  provides  the 
highest  maximum  throughput  among  all  reliable  multicast  protocol 
classes  proposed  to  date.1  In  a  tree-NAPP  protocol,  the  receivers 
of  a  reliable  multicast  group  are  organized  into  an  acknowledgment 
tree  (ACK  tree)  built  on  top  of  the  multicast  routing  tree(s)  provided 
by  such  multicast  routing  protocols  as  DVMRP  [17],  PIM  [18], 
CBT  [19],  or  OCBT  [20].  A  source  multicasts  packets  to  all  the 
receivers  through  the  multicast  routing  tree,  and  responsibility  for 
retransmissions  is  delegated  to  the  receivers.  Retransmissions  take 
place  only  in  local  groups  of  the  ACK  tree,  and  the  number  of 
ACK  traffic  within  each  local  group  is  reduced  by  means  of  NAK- 
avoidance  with  periodic  polling. 

Second,  Section  4  presents  a  simple  extension  of  any  ACK  tree- 
based  reliable  multicast  protocol.  This  extension  allows  the  source 
to  safely  deallocate  packets  from  memory  when  the  ACK  tree  needs 
to  be  modified. 


These  results  are  consistent  with  the  experimental  results  reported  in  [6]. 
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Finally,  we  note  that  it  is  not  reasonable  to  set  up  an  ACK  tree 
for  every  source  in  a  concurrent  multicast  session,  and  that  the 
ACK  tree  should  adapt  to  changes  in  the  constituency  of  either  the 
receiver  set  or  the  multicast  routing  tree(s).  Section  5  completes  our 
case  by  describing  Lorax  ([21]),  which  is  the  first  known  protocol 
that  constructs  and  maintains  a  single  shared  ACK  tree  for  reliable 
concurrent  multicasting.  Lorax  eliminates  the  need  to  maintain 
an  ACK  tree  for  each  source  of  a  reliable  multicast  group,  and 
can  be  used  in  combination  with  any  of  several  tree-based  reliable 
multicast  protocols  proposed  to  date  (e.g.,  [5,  6]  ). 

Section  6  compares  our  approach  with  related  work  and  discusses 
why  Lorax  and  tree-NAPP  protocols  are  the  best  approach  to  date 
for  the  provision  of  scalable,  reliable  concurrent  multicasting  ser¬ 
vices  in  the  Internet.  Conclusions  and  directions  for  future  work  are 
offered  in  Section  8. 

2  BACKGROUND 

To  provide  a  summary  of  known  classes  of  reliable  multicast  proto¬ 
cols,  we  use  a  taxonomy  that  decouples  the  definition  of  the  mech¬ 
anisms  needed  for  the  pacing  of  data  transmission  from  the  mecha¬ 
nisms  needed  for  the  allocation  of  memory  at  the  source  [10],  Each 
protocol  class  can  be  viewed  as  using  two  windows:  a  congestion 
window  ( cw )  that  advances  based  on  feedback  from  receivers  re¬ 
garding  the  pacing  of  transmissions  and  detection  of  errors,  and 
a  memory  allocation  window  ( mw )  that  advances  based  on  feed¬ 
back  from  receivers  as  to  whether  the  sender  can  erase  data  from 
memory.  In  practice,  protocols  may  use  a  single  window  for  pacing 
and  memory  allocation  (e.g.,  TCP  [22])  or  separate  windows  (e.g., 
NETBLT  [23]).  In  all  classes,  packets  are  multicast  unreliably  from 
the  source  directly  to  all  receivers.  The  protocol  classes  differ  on 
how  acknowledgments  flow  from  the  receivers  back  to  the  source. 
A  more  detailed  description  of  all  generic  protocols  can  be  found 
in  [10]. 

2.1  Sender-Initiated  Protocols 

A  sender-initiated  reliable  multicast  protocol  is  one  that  requires  the 
source  to  receive  ACKs  from  all  members  of  a  known  receiver  set  be¬ 
fore  it  is  allowed  to  release  memory  for  the  data  associated  with  the 
ACKs.  The  receivers  are  not  organized  into  any  structure,  and  may 
contact  the  source  directly.  An  example  of  this  type  of  protocols 
is  presented  in  [13].  It  is  well  known  this  scheme  suffers  from  the 
ACK-implosion  problem.  Whether  the  source  or  the  receivers  are  in 
charge  of  pacing  the  source  and  scheduling  retransmissions  is  unim¬ 
portant  for  our  taxonomy.  In  other  words,  regardless  of  whether  a 
sender-based  or  receiver-based  retransmission  strategy  is  used,  the 
source  is  still  in  charge  of  deallocating  memory  after  receiving  all 
the  ACKs  for  a  given  packet  or  set  of  packets.  The  use  of  NAKs  en¬ 
courages  a  shortened  retransmission  latency,  but  is  not  necessary  for 
protocol  correctness.  The  main  limitation  of  sender-initiated  proto¬ 
cols  is  not  that  ACKs  are  used,  but  rather  the  need  for  the  source  to 
process  all  of  the  ACKs  and  to  know  the  receiver  set. 

2.2  Receiver-Initiated  Protocols 

The  critical  aspect  of  receiver-initiated  protocols  for  our  taxonomy 
is  that  no  ACKs  are  used.  The  receivers  send  NAKs  back  to  the 


source  when  a  retransmission  is  needed,  detected  by  either  an  er¬ 
ror,  a  skip  in  the  sequence  numbers  used,  or  a  timeout.  Because  the 
source  receives  feedback  from  receivers  only  when  packets  are  lost 
and  not  when  they  are  delivered,  the  source  is  unable  to  ascertain 
when  it  can  safely  release  data  from  memory.  There  is  no  explicit 
mechanism  in  a  receiver-initiated  protocol  for  the  source  to  release 
data  from  memory  (i.e.,  advance  the  mw),  even  though  its  pacing 
and  retransmission  mechanisms  are  scalable  and  efficient  (i.e.,  ad¬ 
vancing  the  cw). 

Because  the  source  may  experience  NAK-implosion  if  many  re¬ 
ceivers  detect  transmission  errors,  previous  work  on  receiver- 
initiated  protocols  ([8,  11])  adopts  the  NAK-avoidance  scheme  first 
proposed  in  [2]:  upon  detection  of  a  lost  packet,  receivers  sched¬ 
ule  a  NAK  for  a  random  time  in  the  near  future.  During  that  time 
the  receiver  listens  for  a  NAK  by  another  multicast  group  member 
for  the  same  packet.  If  another  NAK  is  heard,  the  transmission  is 
scheduled  for  a  subsequent  time.  It  is  hoped  that  only  one  NAK 
is  sent  by  the  whole  group  to  the  parent  for  a  lost  packet.  We  re¬ 
fer  to  this  protocol  subclass  as  RINA  (for  Receiver-Initiated  with 
NAK-Avoidance).  The  scalable  reliable  multicasting  (SRM)  pro¬ 
tocol  [11]  and  the  “log-based  receiver-reliable  multicast”  (LBRM) 
protocol  [12]  are  examples  of  RINA  protocols. 

2.3  Ring-Based  Protocols 

Our  generic  description  of  ring-based  protocols  is  based  on  the 
Reliable  Multicast  Protocol  (RMP)  [7],  which  is  based  on  the  Token 
Ring  Protocol  (TRP)  [3], 

Ring-based  protocols  work  by  organizing  the  receivers  into  a  ring, 
with  a  rotating  token  site  designated  as  the  only  node  to  ACK  back  to 
the  source  for  the  current  packet.  The  source  deletes  packets  only 
when  an  ACK/token  is  received.  The  ACK  also  serves  to  pass  the 
token  and  to  timestamp  packets,  so  that  all  receiver  nodes  have  a 
global  ordering  of  the  packets  for  delivery  to  the  application  layer. 
Receivers  send  NAKs  to  the  token  site  for  selective  repeat  of  lost 
packets  that  were  originally  multicast  from  the  source.  Both  TRP 
and  RMP  specify  that  retransmissions  are  sent  unicast  from  the 
token  site.  The  token  is  not  passed  to  the  next  member  of  the  ring  of 
receivers  until  the  new  site  has  correctly  received  all  packets  that  the 
former  site  has  received.  Once  the  token  is  passed,  a  site  may  clear 
packets  from  memory.  We  can  characterize  ring-based  protocols  as 
placing  the  token  site  in  control  of  the  mw  (conditional  on  passing 
the  token),  and  placing  control  of  the  cw  with  either  the  token  site 
or  the  source. 

2.4  Tree-Based  Protocols 

Tree-based  protocols  designate  three  types  of  nodes  over  a  pre¬ 
constructed  ACK  tree  during  reliable  transmission:  source,  hop,  and 
leaf  nodes.  Source  nodes  multicast  to  the  entire  receiver  set  and  are 
responsible  to  at  most  B  children  for  retransmissions  of  data.  Leaf 
nodes  are  strictly  receivers  and  have  no  children  in  the  ACK  tree. 
Hop  nodes  are  intermediate  nodes  between  the  source  and  leaves, 
responsible  for  requesting  lost  data  from  a  parent  hop  node  and  for 
retransmitting  data  requested  by  at  most  B  children.  A  hop  node 
and  its  children  constitute  a  local  group,  with  the  hop  node  as  the 


group  leader.  Note  that  leaf  nodes  are  essentially  hop  nodes  with 
no  children,  and  source  nodes  are  hop  nodes  with  no  parent. 

Because  ACKs  are  sent  to  the  parent  node  and  not  the  source,  we 
refer  to  them  as  hierarchical-acknowledgments  (hacks).  In  our 
generic  tree-based  protocol,  a  node  sends  a  HACK  to  its  parent  as 
soon  it  receives  a  packet  correctly  (in  order  to  move  the  cw ),  not 
when  all  its  own  children  (if  any)  have  sent  their  HACKS.  If  the 
source  had  to  wait  for  ACKs  to  be  aggregated  all  the  way  from  the 
leaf  nodes,  it  would  have  to  be  paced  based  on  the  slowest  tree  path. 
In  the  tree-based  protocols  proposed  to  date  ([4,  5,  6] )  the  cw  is 
advanced  by  HACKS,  but  in  there  is  no  provision  for  deleting  packets 
and  advancing  the  mw  safely.  Section  4  addresses  this  important 
point  in  more  detail,  providing  an  extension  to  the  tree-based  class 
that  allows  the  source  to  safely  delete  packets  and  advance  the  mw. 

We  assume  that  the  source  and  group  leaders  control  the  retrans¬ 
mission  timeouts;  however,  such  timeouts  can  be  controlled  by  the 
children  of  the  source  and  group  leaders.  Accordingly,  when  the 
source  sends  a  packet,  it  sets  a  timer,  and  each  hop  node  sets  a  timer 
as  it  becomes  aware  of  a  new  packet.  If  there  is  a  timeout  before 
all  HACKS  have  been  received,  the  packet  is  assumed  to  be  lost  and 
is  retransmitted  by  the  source  or  group  leader  to  its  children.  We 
assume  a  selective  repeat  strategy  is  used,  so  that  once  a  packet  is 
received  correctly,  it  is  never  rebroadcast  to  the  local  group  again. 
Several  tree-based  protocol  possibilities  are  discussed  in  [4],  and 
have  been  fully  developed  as  the  Reliable  Multicast  Transport  Pro¬ 
tocol  (RMTP)  [5], 

2.5  Tree-NAPP  Protocols 

Tree-NAPP  protocols  are  a  subclass  of  tree-based  protocols.  The 
utilization  of  NAK-avoidance  and  periodic  polling  described  in  [2] 
by  the  local  groups  in  a  tree-based  protocol  defines  this  subclass. 
Naks  alone  are  not  sufficient  to  guarantee  reliability  with  finite 
memory,  so  receivers  send  a  periodic  positive  (hierarchical)  ac¬ 
knowledgment  to  their  parents  so  that  the  cw  may  be  advanced. 
Note  that  the  setting  of  timers  needed  for  NAK  avoidance  is  done 
entirely  on  the  local  group  scale,  so  it  is  scalable. 

An  implementation  of  tree-NAPPing  can  be  found  in  the  Tree- 
based  Multicast  Transport  Protocol  (TMTP)  [6],  One  approach 
to  implement  NAK-avoidance  within  a  local  group  of  an  ACK  tree 
consists  of  using  a  multicast  address  for  each  local  group  of  the 
ACK  tree. 

3  MAXIMUM  THROUGHPUT  ANALYSIS 

To  analyze  the  relative  maximum  throughput  of  reliable  multicast 
protocols,  we  continue  to  use  the  same  model  used  in  [10]  and  first 
introduced  in  [8],  which  focuses  on  the  processing  requirements  of 
generic  reliable  multicast  protocols,  rather  than  the  communication 
bandwidth  requirements.  Accordingly,  the  maximum  throughput  of 
a  generic  protocol  is  a  function  of  the  per-packet  processing  rate  at 
the  sender  and  receivers,  and  the  analysis  focuses  on  obtaining  the 
processing  times  per  packet  at  a  given  node. 

We  assume  a  single  sender,  X,  multicasting  a  constant  stream  of 
packets  to  R  identical  receivers.  For  clarity,  we  assume  a  single 
ACK  tree  rooted  at  the  source.  All  loss  events  at  any  node  in  the 


multicast  are  mutually  independent,  the  probability  of  packet  loss  is 
p  for  any  node,  and  no  ACK  is  ever  lost. 

Following  the  notation  in  [8]  and  [10],  we  place  a  superscript  H2  on 
any  variables  relating  to  the  generic  tree-NAPP  protocol.  Additional 
notation  and  variables  are  introduced  as  needed  in  the  analysis; 
Figure  1  is  a  complete  list  of  all  variables  used  in  this  paper  for 
quick  reference.  The  following  paragraphs  derive  the  maximum 
throughput  for  tree-based  protocols  with  local  NAPP;  the  maximum 
throughputs  for  the  rest  of  the  classes  are  derived  in  [8,  10]. 

Assuming  a  finite  amount  of  memory  at  every  node,  it  is  easy  to 
show  [10]  that  the  generic  sender-initiated,  ring-based,  and  tree- 
based  protocols  are  free  of  deadlocks  and  deliver  packets  reliably, 
while  RINA  protocols  incur  deadlocks.  Table  2  summarizes  the 
results  on  maximum  throughput  and  correctness  reported  in  [10], 
together  with  the  tree-NAPP  throughput  result  derived  next. 

3.1  Throughput  of  Tree-NAPP  Protocol 

To  bound  the  overall  system  throughput  in  the  generic  tree-NAPP 
protocol,  we  first  derive  and  bound  the  expected  cost  at  the  source, 
hop,  and  leaf  nodes.  To  make  use  of  symmetry,  we  assume,  without 
loss  of  generality  that  there  are  enough  receivers  to  form  a  full  tree 
at  each  level. 

3.1.1  Source  node  We  consider  first  XH2,  the  processing 
costs  required  by  the  source  to  successfully  multicast  an  arbitrarily 
chosen  packet  to  all  receivers  using  the  H 2  protocol.  The  process¬ 
ing  requirement  for  an  arbitrary  packet  can  be  expressed  as  a  sum 
of  costs: 

XH2  =  (initial  transmission)  +  (retransmissions) 

+  (receiving  NAKs)  +  (receiving  periodic  hacks) 

M  M 

XH 2  =  Xs+YJXp(i)  +  YJXn(m)  +  BX<i,  (1) 

*  =  1  m= 2 

where  Xf  is  the  time  to  get  a  packet  from  a  higher  layer,  Xp(i) 
is  the  time  for  (re)transmission  attempt  i,  Xn(m)  is  the  time  for 
receiving  NAK  m  from  the  receiver  set,  X,p  is  the  amortized  time  to 
process  the  periodic  HACK  associated  with  the  current  congestion 
window,  and  M  is  the  number  of  transmissions  attempts  the  source 
will  have  to  make  for  this  packet.  Taking  expectations,  we  have 

E[XH2]  =  E[X,]  +  E[M]E[XP] 

+  ( E[M]  -  1)  E[Xn]  +  B  E[A>]  (2) 

Following  our  previous  analysis  for  tree-based  protocols  [10],  we 
derive  the  value  of  M,  given  that  the  source  has  a  local  receiver 
subset  of  size  B  from  which  to  collect  NAKs  and  retransmit  packets 
to.  The  expected  number  of  transmissions  per  packet  is  [2,  8] 

E[M]  =  E(!)(-D,+1rb)  (3) 

It  is  shown  in  [9]  that  jfe  <  E[M]  <  1  +  ,  where  Hb  = 

'Y^=i  lji,  the  harmonic  numbers.  From  the  known  inequality 
ln(l  +p)  >  ^7-,  it  follows  that  —  Inp  <  Using  this  result, 
assuming  all  operations  (e.g.  Xf  and  Xp)  are  of  constant  cost,  and 
taking  into  account  that  Hb  €  0(ln  B),  it  is  shown  in  [8]  that 


B  -  Branching  factor  of  a  tree,  the  group  size. 

R  -  Size  of  the  receiver  set. 

Xf  -  Time  to  feed  in  new  packet  from  the  higher  protocol  layer. 

Xp  -  Time  to  process  the  transmission  of  a  packet. 

Xa,Xn,Xh  -  Times  to  process  transmission  of  a  ack,  nak,  or  hack. 

Xt,Yt  -  Time  to  process  a  timeout  at  a  sender  or  receiver  node  respectively. 

Yp  -  Time  to  process  a  newly  received  packet. 

Yj  -  Time  to  deliver  a  correctly  received  packet  to  a  higher  layer. 

Yn ,  Yh  -  Times  to  transmit  a  nak  ,  or  hack  respectively. 

X^Ytj,  -  Times  to  process  the  reception  and  transmission,  respectively,  of  a  periodic  hack  . 

p  -  Probability  of  loss  at  a  receiver;  losses  at  different  receivers  are  assumed  to 

be  independent  events. 

Mr  -  Number  of  transmissions  necessary  for  receiver  r  to  successfully  receive  a  packet. 

M  -  Number  of  transmissions  for  all  receivers  to  receive  the  packet  correctly;  M  =  maxr{Mr} 

XH 2 ,  Y H2  -  the  processing  time  per  packet  at  the  sender  and  receiver  respectively  in  protocol  H2 

HH 2  -  Processing  time  per  packet  at  a  hop  node  in  tree-based  protocols. 

A”  -  Throughput  for  protocol  w  e  { A,N1,N2,R,H1,H2 }  where  x  is  one  of  the  source  s, 

receiver  (leaf)  r,  or  hop-node  h.  No  subscript  denotes  overall  system  throughput 


Figure  1 :  Notation 
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Figure  2:  Analytical  bounds  and  results  on  correctness. 
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Using  Eq.  4,  we  can  bound  Eq.  2  as  follows 
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In  5X5) 


E  [YH2]  =  E[M](l-p)E[Yp]  +  E[Yf]  +  E[Y4] 

+  (E[M]-l)(S|2l  +  (B_i)K) 

+  Prob{Mr  >  2}(E[Mr|Mr  >  2]  -  2)  E [Yt]  (7) 
It  follows  from  the  distribution  of  Mr  that  [8] 


It  then  follows  that  when p  is  a  constant  E[XH2]  £  0(1). 

3.1.2  Leaf  nodes  Let  YH2  denote  the  requirement  on  nodes 
that  do  not  have  to  forward  packets  (leaves).  Let  Yp(i)  be  the  time 
it  takes  to  process  the  (re)transmission  i,  Yn(i )  be  the  time  it  takes 
to  send  NAK  i,  Xn(i)  be  the  time  it  takes  to  receive  NAK  i  (from 
another  receiver),  Yt  be  the  time  to  set  the  ith  timer,  Yj  be  the  time 
to  deliver  a  packet  to  a  higher  layer,  and  Y$  be  the  amortized  cost  of 
sending  a  periodic  HACK  for  a  group  of  packets  of  which  this  packet 
is  a  member. 

YH2  =  (receiving  transmissions)  +  (sending  periodic  hacks) 
+  ( sending  NAKs)  +  (receiving  NAKs) 

M 

YH 2  =  £(l  -p)YP(i)  +  Y,  +  Y* 

1=1 

+  DXf  +  |B-1>TT> 

1  =  2 

Mr-  1 

+  Prob{Mr  >  2}  E  Yt{i)  (6) 

1  =  2 


L[A/r|:Wr  >  l]E[V.r  =  §2  —  p)/(l  —  p)  (8) 

E[Mt\Mt  >  2]  E[Yt]  =  (3  -  2p)/(l  -  p)  (9) 

Therefore,  noting  that  Prob{Mr  >  2}  =  p2 ,  we  derive  the 
expected  cost  as 


E  [YH2] 


E[AL](1  —  p)  E[lp]  +  E[Yf]  +  E[T^,j 
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Again,  using  the  bound  of  E[M]  given  inEq.  4,  we  can  bound  Eq.  10 
by 


E[Y£ 


Oj  1  +  (l-P+Pln5  +  /(l-4p))  (  (U) 


Whenp  is  treated  as  a  constant  E [YH2~]  £  0(1). 


3.1.3  Hop  nodes  To  evaluate  the  processing  requirement  at  a 
hop  node,  h ,  we  note  that  a  node  caught  between  the  source  and  a 
node  with  no  children  has  a  two  jobs:  to  receive  and  to  retransmit 
packets.  Because  it  is  convenient,  and  because  a  hop  node  is  both  a 


Taking  expectations  of  Eq.  6, 


sender  and  receiver,  we  will  express  the  costs  in  terms  of  X  and  Y. 
Our  sum  of  costs  is 

HH2  =  (receiving  transmissions)  (sending  periodic  hacks) 
+  (receiving  periodic  HACKS)  -+-  (receiving  NAKs) 

+  (sending  NAKs)  +  (retransmissions  to  children) 

M 

HH 2  =  (i-p)Y^Yp(i)  +  Y4  +  BX4,  +  Yf 


+E(2#  +  (s-')t» 

i=2 

Mr  =  1 

+  Prob{ Mr  >2}  E  v'(') 

i=2 

M 

+  +  Xp(i)) 


Computing  the  expected  value  of  Hh 2 , 

E [HH2]  =  ( 1  -p)  E [M]  E[YP]  +  E[&]  +  B  E[A>]  +  E [Y>] 
+  (E[M]-1)(^+(B_1)^) 

+  P2(TZ 7 -2)  EM 

+  (E[M]  -  1)(E[X„]  +  E[A'P] )  (13) 

In  other  words,  the  average  cost  on  a  hop  node  is  the  same  as  a 
source  and  a  leaf,  without  the  cost  of  receiving  the  data  from  higher 
layers  and  one  less  transmission  (the  original  one) 

E [HH2]  =  E [YH2]  +  E[XH2]  -  E[X f]  -  Epfp]  (14) 

Therefore,  Eq.  13  can  be  bounded  by 

e[hH2]  e  o(E[yH2])  u  o(e[xH2]) 

e  o(i  +  (1~p  +  plnig  +  p2(1~4p))^)  (is) 


Figure  3:  Above:  The  maximum  throughput  for  each 
protocol.  Bottom:  Number  of  supportable  receivers  for 
each  protocol.  The  branching  factor  for  trees  is  set  at 
10. 


When  p  is  a  constant  E^^2]  €  0(1).  Therefore,  all  nodes  in 
the  tree-NAPP  protocol  have  a  constant  amount  of  work  to  do  with 
regard  to  the  number  of  receivers. 

3.1.4  Overall  system  Let  A f2  =  1/E[A'H2],  Af2  = 
1/  E[HH2~],  Af2  =  1/  E [YH2]  equal  the  throughput  at  the  sender, 
hops,  and  leaves,  respectively,  then 

A  =  mm{ As  ,  A h  ,Ar  }  (16) 

From  Equations  5,  11,  15,  and  16,  it  follows  that 

1/A«  e  0(l  +  (l-P  +  Pln5  +  p2(l-4p)^  (1?) 

Accordingly,  if  either  p  is  a  constant  or  p  — »  0,  we  obtain  from 
Eq.  17  that  l/AH2  £  0(1).  Therefore,  the  maximum  throughput 
of  the  tree-NAPP  protocol,  as  well  as  the  throughput  with  non- 
negligible  packet  loss,  is  independent  of  the  number  of  receivers. 
Tree-based  protocols  is  the  only  class  of  reliable  multicast  protocols 
that  exhibits  such  a  degree  of  scalability  with  respect  to  the  number 
of  receivers. 


3.2  Numerical  Results 

To  examine  the  relative  performance  of  the  various  classes  of  pro¬ 
tocols,  all  mean  processing  times  are  set  equal  to  1,  except  for  the 
periodic  costs  X,p  and  Yp  which  are  set  to  0. 1.  Figure  3(a)  compares 
the  relative  throughputs  of  the  protocols  described  in  Section  2.  The 
graph  represents  the  inverse  of  Eq.  13  as  the  exact  expected  through¬ 
put  for  tree-NAPP  protocols  as  well  as  the  throughput  equations  de¬ 
rived  in  [8,  10]  for  all  other  classes.  The  top,  middle  and  bottom 
graphs  correspond  to  increasing  probabilities  of  packet  loss,  1%, 
10%,  and  25%,  respectively. 

The  throughput  of  tree-based  and  tree-NAPP  protocols  are  inde¬ 
pendent  of  the  size  of  the  receiver  set,  and  therefore  any  increase 
in  processor  speed  would  directly  increase  throughput.  A  smaller 
branching  factor  would  also  increase  throughput  at  the  cost  of  a 
longer  path  that  retransmissions  must  traverse  to  an  expecting  re¬ 
ceiver. 

Figure  3(b)  shows  the  number  of  supportable  receivers  by  each 
of  the  different  classes,  relative  to  processor  speed  requirements. 


This  number  is  obtained  by  normalizing  all  classes  to  a  baseline 
processor.  As  described  in  [8,  9],  the  baseline  uses  a  sender-initiated 
protocol  and  can  support  exactly  one  receiver.  As  in  [8,  9],  let 
pw[R],  be  the  speed  of  the  processor  that  can  support  at  most  R 
receivers  under  protocol  ui,  where  w  £  {A,  N 1,  N 2,  R,  HI,  H 2} 
representing  sender-initiated,  receiver-initiated,  RINA,  ring-based, 
tree-based,  and  tree-NAPP,  respectively.  If  we  set  /UA[1]  =  1  as  a 
baseline  it  is  shown  in  [8]  that 
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The  speedup  of  tree-NAPP  protocols  can  be  calculated  as  the  ratio 
of  their  expected  cost  (Eq.  13)  to  the  baseline 
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In  [8,  10],  the  number  of  supportable  receivers  derived  for  sender- 
and  receiver-initiated,  RINA,  ring-based,  and  tree-based  protocols 
are  shown  to  be 
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Because  the  exact  value  of  E [M]  is  difficult  to  compute  for  large 
values  of  R,  as  in  [8,  9],  we  use  the  following  approximation 


E[M\ 


{Hs5  -Hr) 

Mp) 


where  a  is  the  value  of  E[M]  for  R  =  35  and  Hu  is  the  harmonic 
series.  When  evaluating  pH1[R],  or  pH2[R\  an  exact  value  for 
E [M]  is  used  because  the  number  of  receivers  is  always  R  =  B  = 
10. 


From  Figure  3,  it  is  clear  that  only  the  tree-based  classes  can  sup¬ 
port  any  number  of  receivers  for  the  same  processor  speed  bound  at 
each  node.  It  is  also  clear  that,  in  terms  of  performance,  tree-NAPP 
protocols  are  superior  to  other  classes.  Of  course,  our  model  consti¬ 
tutes  only  a  crude  approximation  of  the  actual  behavior  of  reliable 
multicast  protocols.  In  the  Internet,  an  ACK  or  a  NAK  is  simply  an¬ 
other  packet,  and  the  failure  to  deliver  a  given  packet  correctly  to 
a  receiver  is  correlated  with  what  happens  at  other  receivers,  be¬ 
cause  packets  are  distributed  along  multicast  routing  trees.  Nev¬ 
ertheless,  our  approximate  model  is  still  a  valuable  tool  as  a  first- 
order  comparison  of  reliable  multicast  protocols  and  produces  re¬ 
sults  that  should  be  expected.  Because  tree-based  protocols  dele¬ 
gate  responsibility  for  retransmission  to  receivers  and  because  they 
employ  techniques  applicable  to  either  sender-  or  receiver-initiated 
protocols  within  local  groups  (i.e.,  a  node  and  its  children  in  the 
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Figure  4:  Packet  loss  for  a  subtree  due  to  a  hop  node 
failure. 


tree)  of  the  ACK  tree  only,  any  mechanism  that  can  be  used  in  a 
receiver-initiated  protocol  can  be  adopted  in  a  tree-based  protocol, 
with  the  added  benefit  that  the  throughput  and  number  of  support¬ 
able  receivers  is  completely  independent  of  the  size  of  the  receiver 
set,  regardless  of  the  likelihood  with  which  packets  are  received 
correctly  at  the  receivers.  The  rest  of  this  paper  describes  how  to 
construct  shared  ACK  trees  in  a  scalable  and  fault-tolerant  manner. 

4  DEALLOCATING  MEMORY 

The  ACK  tree  structure  greatly  improves  throughput  over  other 
classes.  However,  an  inherent  weakness  in  the  basic  approach  of 
delegating  the  responsibility  of  retransmissions  to  group  leaders  is 
that  the  failure  of  one  or  more  hop  nodes  can  cause  an  entire  subtree 
to  lose  a  series  of  packets  during  re-establishment  of  the  tree. 

After  node  failure,  applications  could  either  (a)  terminate  the  ses¬ 
sion,  (b)  continue  the  session  without  providing  a  reliable  service  to 
nodes  while  they  are  temporarily  disconnected,  or  (c)  allow  nodes  to 
rejoin  and  catch  up  with  the  session.  For  the  first  two  cases,  changes 
in  the  ACK  tree  do  not  create  a  problem,  and  the  cw  and  mw  can  ad¬ 
vance  together  at  the  source  and  each  hop  node.  The  rest  of  this 
section  considers  necessary  extensions  for  the  last  case.  Figure  4 
illustrates  the  problem  with  an  example.  Packets  are  multicast  from 
the  source  to  the  receiver  set;  nodes  that  have  received  the  data  cor¬ 
rectly  are  shaded.  Packets  are  acknowledged  as  they  are  received, 
rather  than  waiting  for  the  acknowledgments  from  children.  Node 
A,  and  all  other  nodes  that  have  received  HACKS  from  all  their  chil¬ 
dren,  delete  packets;  however,  B  fails  before  it  is  able  to  confirm 
that  all  its  children  have  correctly  received  the  data.  If  we  assume 
that  at  least  one  child  C  has  not  received  the  data,  then  there  is  no 
node  with  whom  to  re-establish  contact  that  will  definitely  have  a 
copy  of  the  data.  One  solution  to  this  problem  is  to  buffer  the  en¬ 
tire  session  in  a  secondary  store,  as  is  done  in  RMTP  and  SRM. 
However,  this  solution  can  become  unscalable. 

Ideally,  we  would  like  to  keep  data  in  a  finite  secondary  store 
only  until  all  current  receivers  have  correctly  received  the  data, 
without  having  any  node  keeping  track  of  who  all  the  receivers  are. 
Fortunately,  deadlocks  due  to  receiver  failures  or  reconfigurations  of 
the  ACK  tree  or  the  underlying  multicast  routing  tree  can  be  easily 
avoided  in  tree-based  protocols  by  introducing  aggregate  ACKs  that 


propagate  from  the  receivers  up  the  tree  to  the  source.  The  ACK  sent 
from  a  node  to  its  parent  in  the  tree  consists  of  its  own  ACK  and  the 
aggregated  ACKs  from  all  its  children.  Just  as  in  the  generic  tree- 
based  protocol,  correctly  received  data  packets  are  acknowledged 
using  HACKS.  However,  packets  are  not  deleted  at  this  point,  they 
are  kept  in  a  secondary  store  or  partition  of  memory.  When  a 
parent  of  a  leaf  node  confirms  that  all  its  children  have  correctly 
received  the  data,  it  deletes  the  data  from  secondary  store  and  sends 
an  aggregated  ACK  to  its  own  parent.  Hop  nodes  do  the  same 
procedure.  In  terms  of  our  taxonomy,  aggregate  ACKs  are  used  to 
move  the  mw  and  HACKS  (and  negative  HACKS  )  to  move  the  cw. 

The  following  two  additional  mechanisms  are  used  together  with 
aggregated  ACKs  to  ensure  that  a  disconnected  node  or  subtree  is 
never  allowed  to  rejoin  the  ACK  tree  after  the  source  has  erased  data 
from  memory  that  the  rejoining  node  or  subtree  never  received. 

First,  a  node  that  perceives  one  of  its  children  as  disconnected  as¬ 
sumes  the  reception  of  any  pending  aggregated  ACK  from  that  child 
and  sets  a  topology-change  notification  flag  in  its  own  aggregated 
ACK.  The  setting  of  the  flag  is  preserved  as  the  aggregated  ACK  trav¬ 
els  back  to  the  source.  The  flag  instructs  the  source  to  wait  for  an 
even  longer  period  of  time  before  erasing  the  associated  data  from 
memory  after  receiving  all  the  aggregated  ACKs  from  its  children. 

Second,  an  orphan  node  is  given  a  finite  amount  of  time  to  reconnect 
to  the  ACK  tree.  This  time  is  much  shorter  than  the  time  set  in 
a  “connect  timer”  at  the  source.  Once  an  orphan  node  times  out, 
it  cannot  join  the  session  and  catch  up  without  application-level 
support.  The  next  section  provides  more  details  on  the  handling 
of  orphans  within  the  context  of  Lorax. 

5  SHARED  ACK  TREES 

For  a  concurrent  multicast  session,  it  is  not  reasonable  to  manage  a 
separate  ACK  tree  for  every  source.  To  remedy  this,  Lorax  supports 
the  proper  dissemination  of  all  acknowledgments  in  a  multicast 
group  along  a  single  shared  ACK  tree  of  the  concurrent  multicast 
session.  The  routing  scheme  used  in  Lorax  is  adapted  from  a 
technique  developed  for  the  routing  of  messages  between  multiple 
processor  elements  described  in  [24], 

Consider  an  ACK  tree  created  for  one  original  source,  in  which 
nodes  have  at  most  B  children.  The  tree  can  be  re-hung  as  an  ac¬ 
knowledgment  tree  with  any  other  node  as  the  root,  and  all  nodes 
will  still  have  at  most  B  children.  This  is  a  well  known  property 
of  trees  that  allow  us  to  apply  the  constant-cost  results  of  Section  3 
to  any  protocol  utilizing  a  shared-tree  for  concurrent  multicast  ses¬ 
sions. 

When  an  ACK  tree  is  created  for  a  single-source  multicast  session 
with  the  source  as  the  root  of  the  tree,  the  routing  of  aggregate  ACKs 
to  the  appropriate  hop  node  towards  the  source  is  simple:  each  node 
HACKS  to  its  designated  parent.  However,  the  situation  is  more 
complicated  for  shared  ACK  trees.  The  introduction  of  multiple 
sources  clashes  with  the  inherent  anonymity  of  the  tree:  receivers 
in  the  ACK  tree  lack  knowledge  of  where  each  source  is  located, 
knowledge  that  can  be  used  to  route  ACKs  to  the  appropriate  hop 
node  leading  to  a  particular  source.  In  this  paper  we  refer  to  the 


actions  a  node  takes  to  discover  which  adjacent  node  on  the  shared 
ACK  tree  lies  on  the  path  to  a  particular  source  node  on  the  shared 
ACK  tree  as  routing. 

Routing  in  an  anonymous  ACK  tree  can  be  done  efficiently  using 
implicit  routing,  with  which  each  node  is  labeled  based  on  its 
position  in  the  tree.  All  packets  from  a  source  include  this  label, 
from  which  receivers  can  infer  to  which  child  or  parent  to  route 
towards  that  source.  One  such  labeling  scheme  is  presented  in  [25], 
but  the  algorithm  requires  the  entire  tree  to  be  relabeled  when  any 
node  is  added  or  deleted.  Our  adaptation  of  [24]  involves  only  two 
nodes  for  a  completely  new  addition  (the  added  node  and  its  parent), 
and  deletions  require  re-labeling  of  the  subtree  of  the  deleted  node 
when  patched  back  into  the  ACK  tree. 

First  we  describe  the  construction  of  the  shared  ACK  tree,  then  the 
labeling  and  routing  scheme.  We  then  describe  tree  maintenance, 
including  how  nodes  can  split  off  children  and  how  node  deletions 
are  handled. 

5.1  Ack  Tree  Construction 

Our  approach  assumes  the  existence  of  the  multicast  routing  tree(s) 
provided  by  the  underlying  multicast  routing  protocols.  In  the 
Internet,  these  trees  will  be  built  using  such  protocols  as  Distance 
Vector  Multicast  Routing  Protocol  (DVMRP)  [17],  Core  Based 
Trees  (CBT)  [19],  Ordered  Core  Based  Trees  (OCBT)  [20],  or 
Protocol  Independent  Multicast  (PIM)  [18], 

To  construct  the  ACK  tree,  Lorax  utilizes  a  combination  of  root- 
based  and  off-tree  schemes  to  grow  the  tree.  These  schemes  are 
based  on  the  common  expanding  ring  search  (ERS)  technique  over 
the  underlying  multicast  routing  tree(s)  and  mechanisms  intended 
to  limit  the  cost  of  each  ERS. 

The  ACK  tree  is  grown  from  a  single  root  node  using  either  the 
source  multicast  routing  tree  of  the  root  node  or  the  common  mul¬ 
ticast  routing  tree  of  the  multicast  group.  The  root  node  may  be 
selected  before  the  session  starts  and  advertised  together  with  the 
multicast  address,  or  may  be  selected  when  the  session  begins  us¬ 
ing  an  election  algorithm. 

After  joining  the  IP  multicast  address,  all  nodes  are  considered  off- 
tree  except  for  the  root  node  of  the  ACK  tree.  The  root  immediately 
begins  multicasting  invitation-to-join  messages  (INV)  using  the 
underlying  multicast  routing  tree  with  a  time-to-live  (TTL)  value 
of  zero  in  the  IP  header,  and  sets  a  timer  Tinv  ■  An  off-tree  node 
that  hears  an  INV  message  unicasts  a  request  (REQ)  to  be  adopted 
back  to  the  inviting  node.  If  an  inviting  node  does  not  hear  a  REQ 
before  Tinv  expires,  it  multicasts  a  new  INV  with  a  larger  TTL 
value  and  resets  Tinv  to  a  longer  timeout.  When  a  REQ  is  received 
correctly  at  the  inviting  node,  a  bind  message  (BIND)  is  sent  to  the 
new  child  confirming  the  adoption.  Once  the  new  child  receives  the 
BIND,  it  becomes  an  on-tree  node  and  starts  the  same  process  again 
by  multicasting  an  INV.  This  process  stops  at  any  on-tree  node  (i.e., 
a  node  that  is  “growing  the  ACK  tree”)  when  the  node  has  several 
children,  or  the  TTL  field  of  its  INV  reaches  a  maximum  value. 

Note  that  the  maximum  TTL  of  an  INV  is  much  smaller  than  the 
TTL  of  data  packets  or  the  TTL  needed  to  cover  the  entire  under- 


lying  multicast  routing  tree.  This  root-based  strategy  to  create  the 
ACK  tree  is  used  to  avoid  excessive  traffic  over  the  multicast  rout¬ 
ing  tree.  In  practice,  this  scheme  should  suffice  to  create  the  ACK 
tree,  because  most  if  not  all  members  of  a  reliable  multicast  group 
will  want  to  participate  in  the  ACK  tree  (i.e.,  receive  information 
reliably). 

In  the  unlikely  scenario  in  which  a  large  number  of  members  of 
the  reliable  multicast  group  does  not  want  to  receive  information 
reliably,  growing  the  ACK  tree  from  the  root  only  may  result  in  the 
formation  of  a  “frontier”  of  leaf  nodes  on  the  ACK  tree  that  may 
not  reach  nodes  interested  in  receiving  packets  reliably  but  who  are 
beyond  the  maximum  allowed  TTL  of  INVs  from  frontier  nodes.  To 
account  for  this  case,  Lorax  includes  an  off-tree  scheme  for  off-tree 
nodes  to  reach  the  ACK  tree. 

Allowing  off-tree  nodes  to  freely  multicast  until  they  find  a  par¬ 
ent  may  cause  the  underlying  multicast  routing  tree  to  become  con¬ 
gested  with  search  messages.  This  method  is  similar  to  the  single¬ 
source  tree  construction  method  presented  in  [6].  Lorax  solves  this 
problem  by  limiting  the  scope  of  ERS  multicasts  needed  to  reach  the 
ACK  tree.  More  specifically,  consider  an  off-tree  node  that  joins  the 
multicast  session  and  call  it  orphan  node  o.  This  node  starts  multi¬ 
casting  query  (QRY)  messages  looking  for  a  new  parent  in  an  ERS 
fashion  after  one  of  the  following  outcomes  occurs:  (a)  A  timeout 
expires  after  joining  the  multicast  session  without  the  reception  of 
an  INV  message  from  a  node  in  the  ACK  tree,  or  (b)  having  sent 
its  REQ  a  number  of  times,  a  timeout  expires  without  receiving  the 
corresponding  BIND. 

When  an  off-tree  node  hi  receives  node  o's  QRY,  it  responds  to  o 
with  a  DIF  message.  Node  o  may  receive  multiple  such  replies,  and 
can  pick  any  one  of  the  responding  nodes  as  its  helper  in  joining 
the  ACK  tree.  If  o  chooses  hi  as  its  helper,  the  two  nodes  then 
periodically  send  nexus  messages  (NEXUS)  to  each  other  verifying 
that  there  is  a  nexus  from  o  to  h  i .  A  nexus  is  a  directed  connection 
from  o  to  hi  corresponding  to  o's  attempt  to  reach  the  ACK  tree;  it 
can  be  terminated  only  by  o  or  a  resource  failure.  Node  o  need 
not  multicast  more  QRYs  as  long  as  its  nexus  with  hi  is  valid. 
Node  hi  diffuses  the  ERS  towards  the  ACK  tree  by  multicasting 
QRYs  according  to  ERS.  After  a  number  of  ERS  attempts,  node 
h i's  search  may  either  be  successful  and  reach  the  ACK  tree,  or  be 
unsuccessful  and  reach  another  off-tree  node  h2  willing  to  help.  In 
the  latter  case,  a  nexus  from  hi  to  h2  is  established,  and  h2  helps 
diffusing  node  o's  ERS  towards  the  ACK  tree. 

Note  that  the  chain  established  by  the  diffusion  of  node  o's  ERS  by 
one  or  various  helpers  is  not  part  of  the  ACK  tree;  it  is  only  used  to 
contain  the  span  of  the  ERS  multicasts  needed  for  the  orphan  node 
to  reach  the  frontier  of  the  ACK  tree. 

Once  an  on-tree  node  hears  a  QRY  message  for  the  ERS  started 
by  node  o,  it  unicasts  a  response  message  (RSP)  to  node  o.  This 
message  indicates  that  the  sending  node  is  willing  to  adopt  the 
orphan  o.  All  on-tree  nodes  are  required  to  respond  to  any  QRY, 
and  nodes  that  end  up  having  more  children  than  they  can  handle 
go  through  a  process  of  fission,  described  subsequently. 


Figure  5:  An  example  of  the  labeling  scheme. 


An  orphan  node  may  receive  more  than  one  RSP,  in  which  case 
the  orphan  unicasts  a  REQ  to  one  of  the  on-tree  nodes  willing  to 
adopt  and  the  process  outlined  above  continues.  If  the  orphan  node 
already  started  an  ERS  to  join  the  ACK  tree,  it  also  sends  a  terminate 
message  (TERM)  to  its  helper  to  erase  the  nexus  it  had  established 
to  reach  the  ACK  tree.  Any  node  that  is  active  helping  node  o  that 
receives  a  TERM  from  its  incident  nexus  sends  a  TERM  on  its 
outgoing  nexus.  This  process  continues  until  the  chain  started  by 
node  o  to  reach  the  ACK  tree  is  erased. 

A  node  declares  a  nexus  built  to  help  orphan  o  to  be  invalid  after 
timing  out  without  receiving  a  NEXUS  from  the  other  node  in  the 
nexus.  In  that  case,  the  helper  closer  to  the  orphan  node  o  (or  node 
o  itself)  starts  sending  QRYs  again  (i.e.,  it  attempts  to  get  a  new 
helper  to  reach  the  ACK  tree)  and  the  node  to  which  the  nexus  was 
incident  sends  a  TERM  on  its  outgoing  nexus  to  erase  the  rest  of  the 
chain  of  helpers. 

Note  that  a  node  may  participate  in  the  diffusion  of  multiple  ERSs, 
each  for  a  different  orphan.  Each  search  is  is  treated  independently. 
Once  a  node  is  on-tree,  it  must  receive  a  label  used  in  routing  of  ac¬ 
knowledgments,  for  restructuring  of  the  tree  during  node  deletions, 
and  for  the  fissioning  of  local  groups. 

5.2  Labeling  and  Routing 

The  algorithm  used  in  Lorax  for  routing  HACKS  and  aggregate 
ACKs  is  based  on  the  following  simple  scheme:  if  a  source  is  not 
in  a  receiver's  subtree,  then  HACKS  should  be  sent  to  the  parent; 
otherwise,  the  HACK  should  be  sent  to  the  child  who  heads  the 
appropriate  subtree. 

Some  common  definitions  are  used  for  our  formal  description  of  the 
protocol.  As  an  initial  framework,  we  represent  the  network  as  an 
undirected  graph  G  =  (V,  E),  where  V  is  the  finite  set  of  nodes, 
and  E  =  V  x  V  is  the  set  of  edges  representing  the  (currently 
operational)  bi-directional  links  between  nodes.  We  require  each 
node  x  €  G  to  have  a  unique  name  n(a-)  (e.g.,  an  IP  host  address). 

Let  T  C  G  be  the  ACK  tree  over  which  acknowledgments  are 
routed.  The  protocol  assigns  a  unique  integer  label  l(x)  to  each 
node  x  such  that  all  nodes  descendent  from  x  contain  l(x)  as  the 
prefix  of  their  respective  labels.  Figure  6  describes  the  hierarchical 
labeling  algorithm,  where  o  is  the  concatenation  operator.  When 
the  labeling  algorithm  terminates,  every  node  in  the  ACK  tree  has 
a  unique  label,  illustrated  in  Figure  5.  This  label  is  used  by  the 
ACK  routing  algorithm  also  shown  in  Figure  6.  Basically,  a  receiver 
must  check  if  the  source  is  in  its  subtree.  If  it  is,  then  the  label  of 


Label  {graph  G) 

Construct  a  tree  T  C  E  from  G  rooted 
at  some  node  s,  as  above; 
label  l(s)  =  1; 

Call  LABEL-SUBTREE(s,T,l); 

Label  -Subtree  ( node  r,  tree  T,  integer  5), 

Let  to  =  0; 

For  each  child  c  of  r  do: 
label  1(c)  =  5  o  w\ 

Let  <jl>  =  cu  +  1 ; 

Call  Label-Subtree(c,T,/(c)); 

ROUTE  -  PACKET  ( tree  T,  node  n,  node  s,  packet  p) 

If  |/(n)|  >  |I(«)| 

Then  route  the  packet  to  the  parent  of  n; 

Else  compare  the  first  |/(n)  |  low  order  bits 
of  l(n)  and  /(s); 

If  they  are  not  equal 

Then  route  the  packet  to  the 
parent  of  n; 

Else  the  next  |"log2  B~\  explicitly 
states  to  which  child  of  n  the  packet 
should  be  routed. 

Figure  6:  Algorithms  for  ack  tree  labeling  and  routing 

of  ACKS. 


the  receiver  node  will  be  a  prefix  of  the  label  of  the  source  node. 
The  ACK  routing  algorithm  routes  packet  h  acknowledging  source 
S'  s  data  to  the  proper  hop  node  for  receiver  R.  Let  |/(  x )  |  denote  the 
cardinality  of  the  label  of  node  a",  i.e.,  the  number  of  bits  it  contains. 
Each  source  node  includes  its  label  in  all  packets  it  transmits.  It  is 
trivial  for  each  receiver  to  store  the  label  in  the  table  containing 
retransmission  information  (e.g.,  the  last  sequence  number  received 
correctly).  If  at  any  point  the  source  is  assigned  a  new  label,  it  must 
be  multicast  to  all  the  receivers. 

On  average,  nodes  closer  to  the  root  of  the  tree  have  to  com¬ 
pare  fewer  bits  than  leaf  nodes.  The  cardinality  of  the  la¬ 
bel  grows  well  with  an  increasing  receiver  set.  Each  level 
adds  an  additional  (log2  B )  bits,  and  a  tree  of  n  receivers  has 
(logB  n)  levels.  The  number  of  bits  needed  is  therefore  exactly 
(log2  f?)(logB  n)  =  log2  n.  Consequently,  if  32  bits  were  used  in 
the  packet  headers  for  this  label,  then  a  tree  could  handle  232  nodes. 

5.3  Tree  Maintenance 

5.3.1  Group  Fission  The  analysis  of  Section  3  assumes  a  con¬ 
stant  B  which  bounds  the  degree  of  each  node  in  the  tree.  In  prac¬ 
tice,  the  value  of  B  can  be  chosen  independently  at  each  node;  i.e., 
some  machines  are  more  capable  than  others.  For  reasonable  per¬ 
formance,  nodes  should  not  set  B  so  low  that  only  a  few  nodes  can 
be  supported  as  children,  and  B  should  not  be  set  any  higher  than 
the  nodes  can  support  efficiently. 

It  is  clear,  then,  that  an  algorithm  is  needed  to  keep  the  number  of 
children  at  or  below  B  at  each  node;  we  refer  to  this  process  as 
fission.  Arbitrarily  assigning  a  new  parent  to  children  would  not 
preserve  grouping  well,  and  so  we  propose  again  using  the  ERS 
heuristic  for  fissioning  a  group.  An  easy  heuristic  is  to  simply 
disconnect  extra  children  and  let  them  ask  nodes  in  other  local 


groups  for  adoption;  however,  this  may  create  unscalable  amounts 
of  work  at  some  nodes. 

Our  fission  algorithm  requires  that  parent  nodes  keep  track  of  how 
many  additional  nodes  its  current  children  may  take.  This  informa¬ 
tion  is  easily  included  periodically  in  HACKS,  or  as  part  of  a  NAPP 
algorithm.  The  parent  node  sends  an  adopt  message  (ADOPT)  to 
the  child  with  the  most  free  space.  The  ADOPT  forces  the  adopt¬ 
ing  node  to  multicast  to  all  the  nodes  in  its  local  group  in  the  ACK 
tree  a  request-for-children  message  (RFC).  The  nodes  that  respond 
first  with  a  QRY  message  are  currently  closer  and  are  sent  an  RSP 
message  by  the  adopting  node  and  the  fission  is  completed.  From 
here,  the  labeling  algorithm  must  be  run  on  the  new  subtrees  of  the 
adopting  nodes. 

The  question  remains  of  how  many  new  children  to  force  onto  the 
adopting  node.  Initially  it  is  advantageous  to  just  reduce  the  number 
of  nodes  to  slightly  below  B.  If  three  fissions  happen  close  to  each 
other,  measured  by  a  timer  Tgr0wth ,  a  heuristic  is  to  require  that  the 
third  fission  reduce  the  number  of  nodes  down  to  half  B,  and  the 
timer  is  reset.  This  drastic  fissioning  is  motivated  by  the  fact  that  if 
many  fissions  happen  in  such  a  short  period  of  time,  then  the  tree  is 
most  likely  in  a  period  of  growth,  and  needs  to  be  expanded.  Each 
additional  fission  within  the  TgTOwth  period  also  separates  half  the 
children  and  resets  the  timer.  When  no  fissions  happen  within  a  full 
Tgrowth  period,  the  node  initializes  to  small  fissions  again. 

5.3.2  Deletions  We  use  the  same  algorithm  for  accidental  and 
intentional  deletion  of  nodes  from  the  ACK  tree;  only  the  initiating 
conditions  are  different.  The  algorithm  is  motivated  by  the  need  for 
a  fast  distributed  algorithm  that  would  not  force  all  disconnected 
nodes  to  be  children  of  one  parent,  causing  fission.  We  assume  that 
all  children  of  a  deleted  node  are  operating  close  to  their  B  limit, 
and  cannot  take  on  B  more  children. 

To  describe  a  simple  method  of  restructuring  the  ACK  tree,  we 
present  the  following  relation.  Define  “at  least  as  old  as”,  with 
operator  “>”  for  two  children  x,  y  of  a  common  parent:  x  is  at 
least  as  old  as  y  if  integer  values  l(x)  >  l(y). 

The  algorithm  is  simple  and  starts  when  the  parent  node  multicasts 
a  deletion  message  (DEL)  to  the  members  of  its  local  group,  or 
when  a  node  detects  that  it  is  an  orphan.  Since  all  nodes  have  labels 
before  the  deletion  starts,  all  even-labeled  nodes  become  the  child 
of  the  next  lowest  (that  is,  next  eldest)  even-labeled  node.  All  odd- 
labeled  nodes  become  the  child  of  next  lowest  (even-)labeled  node. 
Since  all  nodes  have  a  list  of  their  siblings,  unicast  QRYs  are  sent 
directly  to  the  proper  node,  and  an  RSP  is  expected  in  response  (A 
BIND  is  not  required).  If  new  parents  do  not  respond,  then  the  node 
joins  the  ACK  tree  as  if  it  were  a  new  node. 

While  this  algorithm  is  completing,  the  eldest  node  starts  multicas¬ 
ting  QRYs  to  all  nodes  in  the  multicast  group  using  the  groups's 
multicast  address.  Note  that  as  long  as  this  node  does  not  join  with 
descendents,  the  partial  ordering  of  the  tree  is  preserved.  In  sup¬ 
port  of  this,  we  also  require  that  nodes  retain  their  label  until  a  new 
parent  is  found.  The  reason  is  that  the  eldest  node  may  join  with 
a  disconnected  node  previously  in  its  subtree.  A  loop  is  formed  if 


that  disconnected  node  then  re-joins  with  a  node  in  the  eldest  node' s 
subtree  before  it  can  be  relabeled.  As  long  the  disconnected  node 
retains  the  old  label,  this  scenario  cannot  happen. 

Because  all  other  maintenance  in  the  tree  is  ERS-heuristic-driven, 
it  is  likely  that  the  new  parents  are  close  in  the  network  topology. 
However,  this  is  not  guaranteed,  and  a  node  may  wish  to  keep  a 
counter  of  how  many  times  its  parent  has  been  deleted;  it  can  then 
rejoin  the  ACK  tree  as  a  new  node  when  a  certain  value  has  been 
exceeded. 

When  the  root  of  the  ACK  tree  becomes  disconnected  from  the  tree, 
the  eldest  child  becomes  the  new  root. 

5.3.3  Orphans  If  a  node  has  lost  contact  with  its  parent  for  a 
time  Torphan ,  long  enough  so  that  the  cause  is  clearly  not  conges¬ 
tion,  it  considers  itself  an  orphan,  and  will  have  to  rejoin  the  ACK 
tree  by  initiating  the  method  described  above.  Clearly,  all  descen- 
dents  of  the  orphaned  node  do  not  have  to  rejoin  the  ACK  tree,  but 
must  be  relabeled  with  the  orphaned  node' s  new  prefix. 

When  a  node  is  orphaned  it  may  choose  to  enact  the  secondary 
acknowledgment  protocol  described  in  Section  4.  The  only  thing  an 
orphan  must  do  is  contact  each  source  in  the  ACK  tree,  instructing 
them  that  the  node  might  be  orphaned,  and  to  not  delete  data  until  it 
receives  a  new  parent.  If  there  are  enough  sources,  the  orphan  may 
choose  for  efficiency  to  multicast  this  information.  A  node  may 
choose  to  contact  sources  at  time  Tparanoi4  <  TOTphan  to  be  sure 
the  hold  message  reaches  the  source  in  enough  time.  If  the  sources 
are  following  the  secondary  protocol,  then  normally  they  will  not 
delete  data  until  the  second  ACK  is  received  from  all  children,  or 
a  certain  very  long  timeout  TSOTlrce  has  been  reached.  In  practice, 
nothing  is  guaranteed,  but  if  TSOurce  is  much  longer  than  Torphan 
we  expect  the  protocol  to  work. 

Showing  that  Lorax  is  loop-free  is  simple,  because  the  relation 
“>”  is  reflexive,  transitive,  and  anti-symmetric  and  is  also  a  partial 
ordering.  Let  x  — »  y  denote  that  x  is  the  parent  of  y  because  there 
exists  an  edge  in  the  tree  T  between  x  and  y.  Assume  that  at  some 
time  during  the  operation  of  Lorax,  there  is  a  path  of  nodes  in  the 
ACK  tree  such  that  a— »6— »•••— »c— »a,  and  a  ^  b  ^  c.  Lorax 
requires  that  x  >  y  if  x  is  the  parent  of  y,  then  it  must  be  true  that 
x  >  y.  It  follows  that  both  a  >  c  and  c  >  a  must  be  true.  The  only 
way  in  which  this  can  be  true  is  if  a  =  c,  which  is  a  contradiction, 
and  it  follows  that  Lorax  produces  loop-free  routing  of  ACKs  at  all 
times. 

6  COMPARISON  WITH  RELATED  WORK 

As  we  have  summarized  in  our  taxonomy  of  Section  2,  there  is 
a  growing  body  of  work  on  reliable  multicast  protocols  for  inter¬ 
networks.  Our  results  in  Section  3  clearly  indicate  that  tree-based 
protocols  are  the  first  choice  in  terms  of  performance,  with  RINA 
protocols  being  the  second.  Not  surprisingly,  RINA  and  tree-based 
protocols  are  the  two  prominent  approaches  for  the  implementation 
of  reliable  multicasting  today. 

The  main  motivation  for  RINA  protocols  is  that  using  NAKs  frees 
the  sender  from  having  to  process  every  ACK  from  each  receiver. 
Two  additional  advantages  are  that  the  source  is  not  supposed  to 


know  the  receiver  set  and  and  the  receivers  pace  the  source.  How¬ 
ever,  RINA  protocols  suffer  from  a  number  of  limitations. 

First,  the  RINA  protocols  that  have  been  proposed  to  date  (e.g., 
SRM  [11]  and  LBRM  [12])  have  no  mechanism  for  the  source  to 
know  when  it  can  safely  release  data  from  memory  [10],  LBRM 
uses  a  hierarchy  of  log  servers  that  store  information  indefinitely 
and  receivers  recover  by  contacting  a  log  server.  Using  log  servers  is 
feasible  only  for  applications  that  can  afford  the  servers  and  leaves 
many  issues  unresolved.  If  a  single  server  is  used,  performance 
can  degrade  due  to  the  load  at  the  server;  if  multiple  servers  are 
used,  mechanisms  must  still  be  implemented  to  ensure  that  such 
servers  have  consistent  information.  On  the  other  hand,  SRM 
simply  requires  that  data  needed  for  retransmission  be  rebuilt  from 
the  application.  Since  the  application  is  never  informed  when  data 
has  been  successfully  delivered  to  all  receivers,  all  data  is  stored  at 
all  sources  (and  at  all  willing  receivers)  for  the  length  of  the  session. 

Second,  if  error  recovery  in  a  RINA  protocol  depends  solely  on 
timeouts  at  the  receivers,  end-to-end  delays  can  become  arbitrarily 
large.  For  example,  SRM  requires  every  receiver  to  multicast  pe¬ 
riodic  “session  messages”  specifying  the  highest  sequence  number 
accepted  from  a  source  and  a  time-stamp  used  by  the  receivers  to  es¬ 
timate  the  delay  from  the  source.  The  sequence  number  in  a  session 
message  is  in  effect  an  ACK  to  the  last  packet  from  the  source,  and  a 
receiver  can  keep  “polling”  the  source  periodically  to  ensure  that  the 
source  eventually  delivers  missing  packets  not  caught  by  the  NAK 
scheme.  This  clearly  limits  the  scalability  of  SRM,  because  the  per¬ 
sistence  of  session  messages  forces  every  node  to  know  the  receiver 
set. 

Third,  NAKs  and  retransmissions  must  be  multicast  to  the  entire 
multicast  group  to  allow  suppression  of  NAKs.  The  NAK-avoidance 
was  designed  for  a  limited  scope,  such  as  a  LAN,  or  the  small 
number  of  Internet  nodes  that  can  be  expected  in  a  local  group  of 
an  ACK  tree.  This  is  because  the  basic  NAK-avoidance  algorithm 
requires  that  timers  be  set  based  on  updates  multicast  by  every  node. 
As  the  number  of  nodes  increases,  each  node  must  do  increasing 
amount  of  work!  Even  worse,  nodes  that  are  on  congested  links, 
LANs  or  regions  may  constantly  bother  the  rest  of  the  multicast 
group  by  multicasting  NAKs. 

On  the  other  hand,  tree-based  protocols  eliminate  the  ACK  implo¬ 
sion  problem  and  free  the  source  from  having  to  know  the  receiver 
set,  provide  maximum  end-to-end  delays  that  are  bounded,  and  op¬ 
erate  solely  on  messages  exchanged  in  local  groups  (between  a  node 
and  its  children  in  the  ACK  tree).  As  we  show  in  Section  3,  the 
amount  of  work  required  at  each  node  for  tree-NAPP  protocols  does 
not  increase  with  the  number  of  group  members,  i.e.,  the  throughput 
of  such  protocols  is  not  dependent  on  the  number  of  group  mem¬ 
bers. 

The  only  two  concerns  regarding  the  practicality  of  tree-based  pro¬ 
tocols  are  whether  finite  memory  can  be  used  and  the  effort  needed 
to  build  and  maintain  a  “reasonable”  structure  for  the  ACK  tree  that 
can  be  modified  in  a  dynamic  and  scalable  manner.  Our  approach 
addresses  all  prior  concerns  with  tree-based  protocols.  We  have 
shown  in  section  4  how  to  make  tree-based  protocols  work  correctly 


Figure  7:  Optimality  with  95%  confidence  intervals 
shown  as  vertical  lines. 


with  finite  memory.  Second,  Lorax  is  the  first  protocol  that  provides 
a  shared  ACK  tree  for  efficient  use  among  multiple  sources  and  re¬ 
ceivers  in  a  concurrent  multicast  session,  and  the  ACK  tree  is  main¬ 
tained  dynamically  in  the  presence  of  changes  to  the  receiver  set  or 
the  underlying  multicast  routing  tree. 

7  QUALITY  OF  ACK  TREES 

Throughout  this  paper  we  have  described  the  construction  of  ACK 
trees  making  no  assumptions  regarding  the  structure  of  the  underly¬ 
ing  multicast  routing  tree(s).  However,  there  is  much  to  be  gained 
by  using  a  shared  multicast  routing  tree  such  as  created  by  CBT  or 
OCBT.  With  such  an  underlying  routing  tree,  packets  are  multicast 
from  each  source  to  the  receivers  through  the  same  structure;  re¬ 
ceivers  closer  to  the  source  receive  packets  before  receivers  down 
the  tree  do,  and  there  is  a  correlation  of  packet  loss  at  nodes  hang¬ 
ing  from  the  multicast  routing  subtree  of  a  router.  The  more  these 
relationships  are  preserved  in  the  ACK  tree,  the  better  the  ACK  tree 
performs,  because  latencies  and  retransmissions  within  each  local 
group  of  the  ACK  tree  have  a  direct  correspondence  with  delays, 
congestion,  and  errors  that  occur  in  the  routing  tree. 

We  define  an  ACK  tree  as  optimal  if,  for  all  paths  in  the  underlying 
multicast  routing  tree  that  start  from  the  router  adjacent  to  a  parent 
node  in  the  ACK  tree  and  terminate  at  a  router  adjacent  to  its  child 
node  in  the  ACK  tree,  any  receivers  adjacent  to  a  router  lying  on  that 
path  necessarily  are  children  of  the  same  parent  node  in  the  ACK 
tree.  Unfortunately,  obtaining  an  optimal  ACK  tree  may  be  at  odds 
with  the  number  of  children  in  the  ACK  tree  that  any  given  host  can 
support  in  practice. 

To  gain  insight  on  the  optimality  of  the  ACK  trees  built  with  Lorax, 
we  performed  a  number  of  simulations.2  A  single  routing  tree  was 
created  using  a  simulation  of  CBT  running  on  top  of  the  Distributed 
Bellman  Ford  algorithm  in  a  network  of  25  nodes.  The  routing  tree 
has  its  core  (root)  at  node  10,  and  each  routing  node  in  such  a  tree 
has  a  maximum  degree  of  6.  A  node  of  the  ACK  tree  was  attached 

2  We  thank  Rooftop  Communications  Corporation  for  donating  the  C++  Protocol 
Toolkit. 


to  each  routing  node,  and  each  such  node  was  selected  as  the  root  of 
the  ACK  tree.  For  each  placement  of  the  ACK  tree  root,  each  node  of 
the  ACK  tree  was  allowed  to  have  a  maximum  degree  of  3,  4,  5,  and 
6,  and  Lorax  was  run  to  obtain  the  corresponding  ACK  tree  in  each 
of  the  100  cases.  For  each  ACK  tree  obtained  by  Lorax,  we  counted 
the  number  of  nodes  in  the  ACK  tree  that  adhere  to  our  definition  of 
ACK  tree  optimality.  An  ACK  tree  node  adheres  to  our  optimality 
principle  if  its  router  is  a  descendant  (on  the  routing  tree)  of  the 
router  of  its  ACK  tree  parent. 

The  results  from  this  simulation  experiment  indicate  that  Lorax 
tends  to  build  an  ACK  tree  that  is  optimum  according  to  our  defini¬ 
tion.  Lorax  always  built  optimum  ACK  trees  when  nodes  in  the  ACK 
tree  can  support  up  to  6  neighbors,  and  for  the  rest  of  the  cases  it 
builds  ACK  trees  with  more  than  90%  of  their  nodes  adhering  to  the 
optimality  principle.  The  simulation  results  are  graphed  in  Figure  7. 
The  type  of  topologies  considered  in  [1 1]  to  analyze  SRM's  perfor¬ 
mance  correspond  to  the  case  in  which  Lorax  produces  optimum 
ACK  trees,  and  Lorax  with  a  tree-NAPP  protocol  should  provide 
performance  better  than  or  at  least  equal  to  the  best  performance 
that  can  be  expected  from  SRM. 

As  the  node  degree  needed  for  an  optimum  ACK  tree  exceeds  the 
maximum  degree  that  can  be  supported  by  nodes  in  the  ACK  tree, 
the  structure  of  the  ack  tree  deviates  from  the  routing  tree  structure. 
This  corresponds  to  the  case  in  which  receivers  of  the  multicast 
group  are  sparsely  distributed  over  a  routing  tree. 

8  CONCLUSION 

We  have  established  that  tree-NAPP  protocols  have  better  perfor¬ 
mance  than  all  other  classes  of  reliable  multicast  protocols  using 
a  maximum  throughput  model.  We  have  also  presented  solutions 
to  several  open  questions  concerning  the  implementation  of  shared 
ACK  trees.  Preserving  reliability  during  restructuring  of  the  ACK 
tree  is  easily  guaranteed  using  aggregated  acknowledgments  that 
propagate  from  each  leaf  towards  the  source.  It  is  not  necessary 
to  use  aggregate  ACKs  in  conjunction  with  a  tree-based  congestion 
window  scheme.  It  is  possible  to  use  a  (shared)  tree  for  deallo¬ 
cating  memory  and  an  unstructured  receiver-initiated  scheme  for 
retransmission  requests.  This  class  of  tree-based  receiver-initiated 
NAK-avoidance  (TRINA)  protocols  can  be  viewed  as  an  extension 
to  RINA  protocols  like  SRM  and  LBRM  so  that  packets  can  be 
deleted  safely.  Our  future  work  continues  to  define  instances  of  this 
new  subclass  of  protocols. 

Lorax  maintains  scalable  operation  with  multiple  sources  by  con¬ 
structing  and  maintaining  a  shared  ACK  tree.  Overhead  traffic  is 
contained  during  initial  ACK  tree  construction  by  growing  the  ACK 
tree  from  a  known  root.  Impatient  nodes  are  quieted  down  and  al¬ 
lowed  to  join  the  ACK  tree  by  means  of  expanded  ring  searches  that 
are  narrow  in  scope.  Hierarchical  labeling  of  each  node  makes  im¬ 
plicit  routing  of  acknowledgments  simple  and  preserves  loop-free 
routing  of  such  acknowledgments  over  the  ACK  tree  at  all  times. 
For  the  case  in  which  a  shared  multicast  routing  tree  is  used  at  the 
network  layer,  the  ACK  trees  built  with  Lorax  mirrors  the  multicast 
routing  tree. 

Although  our  empirical  evidence  shows  that  Lorax  creates  ACK 


trees  that  are  reasonably  close  to  an  underlying  shared  multicast 
routing  tree,  changes  in  routing  tables  and  group  membership  can 
make  the  two  trees  differ  from  one  another  over  time.  Furthermore, 
more  efficient  mechanisms  could  be  adopted  in  Lorax  if  hosts  were 
allowed  to  know  more  about  the  structure  of  the  underlying  multi¬ 
cast  routing  trees.  Our  work  continues  to  address  the  opportunities 
presented  by  the  hierarchical  labeling  of  routers,  namely  the  ability 
to  provide  a  directed-multicast  service  over  an  existing  IP-multicast 
routing  tree.  With  a  small  change  in  the  protocols  now  being  pro¬ 
posed  for  the  creation  of  multicast  routing  trees,  Lorax  can  make  in¬ 
telligent  choices  when  constructing  and  maintaining  the  ACK  tree. 
Multicast  routing  protocols  such  as  CBT  and  OCBT  can  create  a 
single  tree  from  which  an  arbitrary  root  node  can  easily  be  picked 
(e.g.,  one  of  the  cores)  to  start  the  labeling  algorithm.  It  is  trivial  to 
incorporate  the  labeling  scheme  presented  in  Section  5.2  into  these 
multicast  routing  protocols.  Forthcoming  publications  define  this 
service  more  formally  and  the  associated  protocols,  and  address  the 
dynamics  of  Lorax  in  large  multicast  groups. 
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